Do Linguistic Style and Readability of Scientific Abstracts Affect their Virality? 
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Abstract 

Reactions to textual content posted in an online social net- 
work show different dynamics depending on the linguistic 
style and readability of the submitted content. Do similar dy- 
namics exist for responses to scientific articles? Our intuition, 
supported by previous research, suggests that the success of 
a scientific article depends on its content, rather than on its 
linguistic style. In this article, we examine a corpus of sci- 
entific abstracts and three forms of associated reactions: ar- 
ticle downloads, citations, and bookmarks. Through a class- 
based psycholinguistic analysis and readability indices tests, 
we show that certain stylistic and readability features of ab- 
stracts clearly concur in determining the success and viral ca- 
pability of a scientific article. 



Introduction 

The generic term virality refers to the tendency of informa- 
tion to spread quickly and widely in a community by word- 
of-mouth processes. Analyzing and recognizing such forms 
of persuasive communication is of paramount importance in 
many theoretical and applied contexts. For example, what 
determines whether content posted and shared on Facebook, 
Digg, Google-H, or Twitter will go viral or not? 

We agree with the view that virality hinges pri- 
marily on the nature of the content being spread 
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dBerger and Milkman 2009} 



Guerini, Strapparava, and Ozbal 2011 1. Yet, when ana- 
lyzing text-rich contexts, another important component may 
contribute to the viral potential of information: its linguistic 
style. Textual snippets posted on social network sites, e.g., 
status updates and tweets, may be prone to receive more 
attention than others based not only on their content but 
also on how they are written ( Quercia et al. 20lT) . Can 
we assume similar dynamics when analyzing virality of 
scientific articles? 

In this paper, we analyze virality in terms of a com- 
munity's response to a scientific article measuring the vol- 
ume of downloads, bookmarks, and citations it receives. 



These three indicators are telling of the extent of pene- 
tration of a scientific article in a given scientific commu- 
nity along three different tangents. Citations in the schol- 
arly record certainly represent the most widely employed 
and accepted measure of validity and visibility in science. 
Yet, due to the lengthy time frames of academic pub- 
lishing, citations are normally accrued relatively slowly. 
Readership, as measured in the total number of clicks or 
downloads a paper receives is the most direct and imme- 
diate yardstick for visibility. Several readership and us- 
age measures have been tested and discussed in the litera- 
ture (IKurtz and Bollen 2010 for a general review). Book- 
marking is another readily available indicator of visibil- 
ity. Websites such as CiteULike ( |www . citeulike . org[ ) 
allow users to store, organize and share links to aca- 
demic papers. These novel measures of impact — down- 
loads, social bookmarks, and social media responses — 
are increasingly being adopted by bibliographic services 
and promise to play an important role in academic eval- 
uation in the near future (|Li, Thelwall, and Giustini20II| 
Shuai, Pepe, and Bollen 20I2| [ 



Preprint. Proceedings of the Sixth International AAAI Conference 
on Weblogs and Social Media (ICWSM 2012). 



Text based investigations of scientific virality have re- 
cently appeared in the literature. Routledge and Smith 
( 1201 II ) analyze corpora of abstracts and fulltexts from dif- 
ferent communities. They consider downloads and within- 
community citations as response indicators to articles and 
use generalized linear models to predict them. Their results 
show that textual features significantly improve accuracy of 
virality predictions over metadata such as authors, topic, and 
publication venues. We consider such finding as a starting 
point for our analysis. In fact, rather than focusing on the 
task of predicting responses, we try to model the non-topical 
features (i.e. language style and readability) of a viral text, 
considering only the abstract of the paper As such, we as- 
sume that virality is triggered mainly by the abstract of the 
scientific article. This is a fair assumption considering that 
the abstract is, by and large, the main vehicle of scientific 
dissemination and circulation in online digital platforms. 

Our approach can be explained in light of a "rapid cog- 
nition" model ( Ambady and Rosenthal 1992 Kenny 1994). 



In this model, the user has to decide in a limited amount of 



time whether to download, bookmark, and/or cite a paper 
In order to make a decision, she exploits cues which are not 
directly related to the content of the paper such as its read- 
ability and writing style, e.g. whether the text is presented 
in an assertive way, using self centered pronouns such as 
"we". In some respects, the rapid cognition model is remi- 
niscent of the mechanisms by which humans routinely make 
judgments about strangers' personality and behavior from 
very short behavioral sequences and non-verbal cues. Those 
intuitions, based on so-called "thin slices" of behavior, the 
process they come by, and their effectiveness in producing 
precise judgments on individual's or group's properties (e.g. 
personality, teaching capabilities, negotiation outcome) have 
been subject to extensive investigation by social psycholo- 



gists (Kenny 19941 



Dataset. Our analysis is based on a corpus of articles in the 
field of physics and astronomy published in the last decade. 
The corpus is obtained from the NASA Astrophysics Data 
System (ADS), a complete database of physics and astron- 
omy literature with a user base which includes virtually ev- 
ery researcher in astrophysics and related disciplines. For 
each paper in this corpus, we avail of the following infor- 
mation: the text abstract of the paper, the number of times 
it is downloaded on the ADS website, the number of times 
it is cited in the literature, the number of times it is book- 
marked on the CiteULike website. From this bibliographic 
corpus we extract three balanced collections of "viral pa- 
pers": (1) the most cited papers (number of cites > 350), 

(2) the most downloaded papers (downloads > 330), and 

(3) the most bookmarked papers (bookmarks > 8). An ad- 
ditional collection is also created, containing a random se- 
lection of non-viral papers (i.e. papers that scored on the 
three indicators above), to be used as a ground compari- 
son. Each one of these collections contains roughly 3,000 
abstracts. The completeness of the ADS database and its 
wide adoption rate guarantees that (i) datasets are homoge- 
neous/comparable and (ii) findings about language style, if 
any, can be tracked back to the viral properties of the ab- 
stracts and not to specific communities over-representation 
in one medium. 

We employ these four datasets to perform two differ- 
ent analyses: (1) a class-based psycholinguistic analysis and 
(2) a readability indices test. The features extracted in the 
first analysis track back to a number of psycholinguistic at- 
tributes, e.g. the way information is presented, the use of per- 
sonal rather than impersonal references to the work, the use 
of time-related verb forms, and so on. With the second anal- 
ysis, we measure the readability of the abstracts, i.e., how 
difficult it is to understand their language. We demonstrate 
that there are important features, not directly connected with 
the content of a paper, which concur in determining its suc- 
cess. 

Class-based psycholinguistic analysis 

To explore the characteristics of viral texts, we em- 
ploy a class-based psycholinguistic analysis of text 
which can be adapted to studies of social contagion 
( Mihalcea and Strapparava 2009] ). We calculate a score as- 



sociated with a given class of words, as a measure of saliency 
for the given word class inside the collection of most cited, 
downloaded and bookmarked articles. 

Given a class of words C = {Wi, W2, ..., VFat}, we de- 
fine the class coverage in the viral abstract collection A as 
the percentage of words from A belonging to the class C: 



CoverageA{C) = 



Ew^ec Frequency A(Wi'^ 
SizeA 



(GOV) 



where Frequency A{Wi) represents the number of occur- 
rences of word Wi inside corpus A, and SizeA represents 
the total size (in words) of the corpus A. Similarly, we de- 
fine class C coverage for the corpus of control abstracts V. 
The dominance score of the class C in the given corpus 
A is then defined as the ratio between the coverage of the 
class in the examples set A with respect to the coverage of 
the same class in the corpus V: 



Dominance a{C) = 



CoverageA{C) 
CoverageT>(C) 



(DOM) 



A dominance score higher than 1 indicates a class that 
is dominant in collection A. A score lower than 1 indicates 
a class that is unlikely to appear in collection A. We use 
the classes of words as defined in the Linguistic Inquiry and 
Word Count (LIWC), which was developed for psycholin- 
guistic analysis dPennebaker and Francis 20^11 1 ■ LIWC in- 
cludes about 2,200 words and word stems grouped into 
about 70 broad categories relevant to psychological pro- 
cesses (e.g., EMOTION, COGNITION). Sample words for 
relevant classes in our study are shown in Table [T] 



LABEL 


Sample words 


CERTAIN 


all, very, fact*, exact*, certain*, completely 


NEGATE 


not, no, zero, without, never 


DISCREP 


but, if, expect*, should 


TENTAT 


or, some, may, possib*, probab* 


SENSES 


observ*, discuss*, shows, appears 


SELF 


we, our, I, us 


SOCIAL 


discuss*, interact*, suggest*, argu* 



Table 1: Word categories along with a sample of correspond- 
ing most frequent words in the datasets 

Results and Discussion Tables |2] and [3] show top ranked 
classes along with dominance scores. In the following we 
clustered these classes according to macro-categories that 
emerged from the analysis. To keep only significant results, 
we made a cutoff for dominance scores included between 1 .2 
and 0.8, as proposed by Mihalcea and Strapparava (I2009L 

Basically, our approach consists in counting words in psy- 
chologically meaningful categories. The LIWC was created 
for spontaneous, personal language production. Since we 
are analyzing scientific texts (non spontaneous, by defini- 
tion), we ruled out those categories that are more focused 
on content rather than style (e.g. RELIGION, MUSIC), be- 
cause their relevance can be connected to the polysemy of 





Downl 


Bookm 


Cite 


CERTAIN 


1.58 


1.51 


1.65 


DISCREP 


1.88 


1.94 


1.71 


EXCL 


1.51 


1.80 


1.26 


FUTURE 


1.25 


1.40 


1.54 


NEGATE 


1.33 


1.42 


1.33 


OTHREF 


3.06 


2.77 


1.62 


PAST 


0.40 


0.64 


0.53 


PRONOUN 


2.19 


2.09 


1.48 


SELF 


3.56 


2.93 


1.82 


SENSES 


1.53 


1.23 


1.32 


SIMILES 


1.30 


1.21 


1.54 


SOCIAL 


1.94 


2.64 


1.63 


TENTAT 


1.36 


1.76 


1.56 


WE 


3.70 


3.07 


1.84 



Table 2: Dominant Word Classes in all three datasets 



the corresponding words rather than a presence of the cate- 
gory itself in the abstract. As an example, consider the words 
"disk", "radio*", "band", "instrument*" from the LIWC 
MUSIC category: in the physics and astronomy field these 
words have a completely different meaning. 

Categories of Basic Virality. We begin by analyzing 
those categories that are dominant in all the three datasets, 
from Table|2] These categories represent a basic form of vi- 
rality, common to all datasets. 

Certainty Dimension. We found a significant dominance 
of categories describing cognitive processes (in particular 
the style of presentation of a given content). Viral papers 
tend to use, in the abstract, polarized forms of such way of 
presenting information. On the one side, a more assertive 
language (CERTAIN) is found - also in the negative form 
(NEGATE). On the other side, certainty is mitigated by 
showing discrepancies between what was expected and what 
was actually found (DISCREP), highlighting the boundaries 
of assertions coverage (EXCL - e.g. but, except, without). 
Interestingly the assertive language is also mitigated by the 
category expressing tentative standpoints (TENTAT). 

Time-related Dimension. With regard to time-related lan- 
guage style we see a positive correlation with verbs in the fu- 
ture form (FUTURE) and a negative correlation with verbs 
in the past form (PAST). 

Self-centered Dimension. Viral articles are usually pre- 
sented in a personal rather than impersonal way, not only in 
the general use of pronouns (PRONOUN) but specifically 
in the use of self centered pronouns, representing the re- 
searchers (SELF, and in particular WE). 

Sense-related and other Dimensions. Finally, in viral pa- 
pers we observe the tendency of describing the work through 
sense-related rather than abstract verbs (SENSES), using 
similitudes (SIMILES, e.g. like) and using terms related to 
social interaction (SOCIAL). 

Categories of Specialized Virality. We also analyze those 
categories that are dominant in only some datasets or that are 
representative of a specific dataset. Results are summarized 
in Tabled 

Certainty Dimension. Frequently downloaded papers use 



less often terms related to achievements (ACHIEVE) and 
more often terms in the ASSENT category (agree*, indeed, 
accepta*), when compared to the control dataset. In general 
the most bookmarked dataset is the only one having a pos- 
itive correlation with the macro-class of cognitive mecha- 
nisms (COGMECH), due to the further correlation with IN- 
HIBIT and INSIGHT. 

Time-related Dimension. Only most bookmarked articles 
show a positive correlation with verbs in the present form. 

Self-centered Dimension. Most downloaded and cited arti- 
cles tend to use more often also self centered pronouns rep- 
resenting the researcher in the first person (I), while most 
downloaded and most bookmarked papers tend to compare 
with other researchers' work (OTHER - their*, they, them). 

Sense-related and other Dimensions. We notice that the 
use of sense-related verbs diverges on the specific senses 
when considering the single viral phenomena (SEE, HEAR, 
FEEL). The use of terms related to social interaction (SO- 
CIAL) is further specialized in verbs concerning communi- 
cation (COMM) in the most bookmarked and cited datasets. 





Downl 


Bookm 


Cite 


ACHIEVE 


0.79 


0.98 


0.91 


ASSENT 


1.58 


0.71 


1.23 


COGMECH 


1.04 


1.28 


1.06 


ARTICLE 


0.93 


0.77 


1.03 


COMM 


1.04 


1.93 


1.95 


FEEL 


0.35 


0.99 


1.12 


HEAR 


0.72 


1.20 


2.04 


SEE 


1.91 


1.25 


1.15 


I 


1.76 


1.12 


1.65 


OTHER 


1.68 


2.09 


1.16 


INHIB 


1.00 


1.39 


1.21 


INSIGHT 


0.97 


1.22 


1.00 


PRESENT 


1.04 


1.28 


1.25 



Table 3: Dominant Word Classes in some datasets 



Readability Index Tests 

We further analyzed the abstracts in the three datasets ac- 
cording to readability indices, to understand whether there 
is a difference in the language difficulty among them. Ba- 
sically, the task of readability assessment consists in quan- 
tifying how difficult a text is for a reader This kind of as- 
sessment has been widely used for several purposes, such as 
evaluating the reading level of children and impaired persons 
and improving Web content accessibility. 

We use two indices to compute the difficulty of an ab- 
stract: the Gunning Fog (Gunning 1952) and the Flesch in- 
dices ( Flesch I9461 I. These metrics combine factors, such as 
word and sentence length, that are easy to compute and ap- 
proximate the linguistic elements that impact on readability. 

The Fog index is a rough measure of how many years 
of schooling it would take someone to understand the con- 
tent; higher scores indicate material that is harder to read. 
Texts requiting near-universal understanding have an index 
less than 8. Academic papers usually have a score between 
15 and 20. 



The Flesch Index rates texts on a 100-point scale. Higher 
scores indicate material that is easier to read while lower 
numbers mark passages that are more difficult to read. 
Scores can be interpreted as: 90-100 for content easily un- 
derstood by an average 11 -year-old student, while 0-30 for 
content best understood by university graduates. 





Fog-index 


Flesch-index 




M 


a 


/i a 


Bookm 


21.02* 


3.37 


8.77* 14.44 


Cites 


19.83t 


4.03 


15.80t 15.10 


Downl 


18.22* 


3.86 


25.86* 13.48 


Control 


19.95 


4.18 


14.80 15.96 



Table 4: Averaged readability indexes for the various 
datasets. * means a statistically significant difference at a 
< 0.001, ^ means no statistically significant difference, with 
respect to the control dataset. T-test used. 



Results and Discussion. As expected all abstracts have 
high-difficulty readability scores (see table |4|i. But interest- 
ingly, while most cited papers have scores that are not sta- 
tistically different from baseline papers, most bookmarked 
papers have abstracts that are harder to read and most down- 
loaded papers have abstracts easier to read. Furthermore, the 
standard deviation tend to diminish in most-bookmarked and 
most-downloaded papers, indicating that these classes tend 
to converge in readability difficulty (F-test, a < 0.001). 

These results suggest different practices/uses associ- 
ated with the various datasets, in line with the assump- 
tion that virality is a phenomenon with many facets 
( Guerini, Strapparava, and Ozbal 201 1 1. These practices can 
be possibly interpreted as steps of a process that goes from 
initial interest/curiosity for an article to the final decision of 
citing it: (i) The most downloaded papers are those that are 
easier to read and probably get more initial attention and 
understanding, (ii) On the contrary the most bookmarked 
are those that need a deeper understanding and so are "put 
in the stack" to be analyzed later on. (iii) Finally, being 
cited is much less connected to readability (indicating that 
what matters in the end is the style and content of the ab- 
stract/paper). 



Conclusions 

In this paper we argued that responses to scientific articles 
are influenced by the linguistic style and readability of their 
abstracts. Through a psycholinguistic analysis and readabil- 
ity tests, we showed that linguistic style of abstracts concurs 
in determining the success of a scientific article. Based on 
these findings, we modified the initial abstract of the present 
paper, so to meet virality criteria of Table 2 (key modifica- 
tions are underlined, added text in bold): 

Reactions to textual content posted in an online social net- 
work show different dynamics hinging depending on the lin- 
guistic style and readability of the submitted content. Do sim- 
ilar dynamics exist for responses to scientific articles? ^¥i» 
Our intuition, supported by previous research, says suggests 



that the success of a scientific article depends on its content, 
rather than on its linguistic style. In this article, we examine 
a corpus of scientific abstracts and three forms of associated 
reactions is examined : article downloads, citations, and book- 
marks. Through a class-based psycholinguistic analysis and 
readability indices tests, we argue show that certain stylistic 
and readability features of abstracts clearly concur in deter- 
mining the success and viral capability of a scientific article. 

The final version of the abstract showed a significant dom- 
inance on 72% of the Word Classes presented in Table 2 
(57% before the modification) and its readability scores (un- 
changed) are 18.81 (Fog-index) and 22.57 (Flesch-index). 
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